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Abstract 

Unbinned maximum likelihood is a common procedure for parameter estima- 
tion. After parameters have been estimated, it is crucial to know whether the fit 
model adequately describes the experimental data. Univariate Goodness of Fit 
procedures have been thoroughly analyzed. In multi-dimensions, Goodness of Fit 
test powers have rarely been studied on realistic problems. There is no definitive 
answer to regarding which method is better. Test performance is strictly related to 
specific analysis characteristics. In this work, a review of multi-variate Goodness 
of Fit techniques is presented. 



1. Introduction 

An important task in particle physics analysis is to estimate how well the fit 
function approximates the measured observations. Goodness-of-fit (gof) tests per- 
form this task. The null hypothesis Hq is that the observed data follow some spec- 
ified probability density. 

A large number of univariate gof methods for unbinned data exist JI!]. Some of 
them can be extended to two or three dimensions. However, they typically use an 
ordering scheme which is not easily extensible to many dimensions. On the other 
hand,common gof tests for binned data, such as tests, can be theoretically used 
in many dimensions. Unfortunately, as the number of dimensions increases, they 
become quite inefficient, suffering from the "curse of dimensionality" [jij]. In fact, 
unless the data sample is extremely large, the majority of bins are empty. The goal 
of this work is to give a comprehensive review of multivariate gof techniques for 
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unbinned data. The power of gof methods varies greatly depending on the charac- 
teristics of the data, alternative hypotheses, goals of the analysis, etc... There does 
not exist a multivariate gof test that is always better. Therefore, the problem to 
address is not which multivariate unbinned gof technique is best, but rather which 
one is the best gof technique given different analysis configurations. 

In the gof procedure, a set of N sample data {x,}j^j is observed. The null hy- 
pothesis is that the random variable X follows a given distribution p{x). Given Hq 
is true, the probability of accepting the null hypothesis is defined as the confidence 
level of the test (1 — «/, where «/ is usually fixed to a few percent). On the other 
hand, given Hq is false, the probability of correctly rejecting the null hypothesis 
(1 — «//) is defined as the power of the test. In statistics, OC/ and an represent Type 
I and Type II errors. One fixes the Type I error, and then looks for the test with 
the highest power. An alternative hypothesis Hi is usually a composite hypothe- 
sis. That is, it gives very little information about the distribution of the data and 
is typically in the form of X ~ g(x), with ^(x) ^ p(x). In the case where more 
is known about the altemative hypothesis, that information should be included in 
the test in order to increase its power. 

Two-sample testing problem is strictly connected to gof and can as well be 
used to perform a fit test. To test the gof, a set of events can be randomly drawn 
from the null hypothesis distribution and a two-sample test can be performed. The 
size of the hull hypothesis sample should be large, but not too large with respect 
to the observed events [3]. 

A useful step before performing any multivariate gof test is to use the theorem 
due to Rosenblatt [4] that allows one to transform the original multidimension 
random vector with a given joint pdf, to a vector uniformly distributed on a mul- 
tidimension unit cube. However, in multidimensions the transformation to unifor- 
mity is not unique. Different transformations lead to different conclusions. For a 
detailed discussion and suggestions on multivariate transformation to uniformity, 
seeRef. {3\. 

In general, all multivariate gof methods described in this work do not specif- 
ically need a transformation to uniformity. However, a transformation to the unit 
hypercube is necessary to make the gof test equally sensitive to all regions of the 
observed space. 

2. Multivariate Tests for Multinomial Distributions 

Multivariate normality can be tested by considering separately each marginal 
variable [1]. Each marginal distribution is tested by itself using univariate gof 
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procedures. The presence of at least one non-normal distribution indicates non- 
normality of the multivariate distribution. Information about the univariate distri- 
butions can often give a good understanding about the multivariate distribution as 
well. For the significance of each univariate test, D'Agostino and Stephens (1986) 
suggest to use ajp where p is the number of dimensions and a the desired gof 
test significance. For a comprehensive review on how to combine significance 
levels, see Ref. 161]. An obvious drawback of this approach is that normality of 
all marginals does not imply joint normality. This is particularly disadvantageous 
considering that in gof tests, unlike the majority of hypothesis tests, one usually 
wants to accept the null hypothesis. 

Mardia (1970) is considered to be the first to have proposed a multivariate test 
for normality. Mardia suggests to test multinormality by estimating multivariate 
skewness and kurtosis. No significant skewness or kurtosis leads one to accept the 
null hypothesis [jvl] . 

Malkovich and Afifi (1973) also suggest to use skewness and kurtosis to check 
for multinormality. They propose to generalize the univariate jiieasures of skew- 
ness and kurtosis by using Roy's union-intersection principle [8]. The asymptotic 
distribution of the statistics used by these tests can be found in Ref. They also 
present a generalization of Shapiro and Wilk statistics for multivariate tests. 

Cox and Small (1978) propose to look for the linear combination of the pair 
of variables that maximizes the curvature of one variable when regressed on the 



other. The maximum curvature is the statistical test [llOll . 

Andrews et al. (1972) introduce the nearest distance test for joint normality. 
Firstly, points are transformed to the unit hypercube. Then, for each point, the 
smallest distance to another point is estimated. Finally, independence is tested 
between a specific transformation of the distances and the original points from 
which the transformation has been derived. Multiple regression techniques are 



used to test for independence [lllll 



A number of techniques have been proposed using scaled residuals, defined 
by Andrews et al. (1971) as 



Z,- = S-2(X,-X) (1) 

where 5^5 is the symmetric square root of the covariance matrix 5; X,, for 
/ = 1, are the n observations and X is the sample mean vector. The Z, are 
distributed symmetrically provided that original distribution is normal. Andrews 
et. al (1971) suggest to create a vector defined as a normalized weighted sum of 
Zi. Projections of the original observations onto the space defined by the vector 
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are univariate. Therefore, a univariate test can be employed lll2ll . 

From the scaled residuals, squared radii or angles can be derived. The squared 
radius is defined as 



^ = Z!Zi (2) 

and Qi is the angle made by Z, with a given line. Under the null hypothesis,the 
probability plots for radius and angle should be linear. Techniques based on radius 
and angle are typically used to estimate bivariate distributions. 

Other tests for multinormality have been proposed by Dahiya and Gurland 
(1973) yjl, Hensler et al. (1977) [14], Csorgo (1986) dlSi], Mudholkar et al. 
(1992) fief and Ghosh and Ruymgaart (1992) wk- 

Unfortunately, a comparison among the above-described tests has never been 
carried out. That makes it hard to give recomendations. Mardia's test is often 
considered the most reliable test for multinormality. D'Agostino and Stephens 
suggest combining Mardia's test with univariate tests for marginal normality to 
gain useful information about the multivariate distribution [1]. 



3. Kolmogorov-Smirnov Test 

The Kolmogorov-Smimov univariate test is an extremely popular gof proce- 
dure [HI]. Given a random sample x of size n, define F{x) as the Cumulative Distri- 
bution Function (CDF) under i?o, and define the Empirical Distribution Function 
(EDF) as 

# ohs. < X 

Fn{x) = , (3) 

n 

The Kolmogorov-Smimov test estimates the maximum absolute difference be- 
tween Fn{x) and F{x). That is, 

D = snp\F„{x)~F{x)\. (4) 



X 



Small values for D lead one to accept the null hypothesis. Other possible statis- 
tics can be created by just considering the maximum positive (sup^F„(x) —F{x)) 
or negative (sup^F(x) —Fn{x)) difference, as well as the sum of the maximum 
positive and negative difference (Kuiper's test) 

Justel et al. (1997) propose a generalization of the Kolmogorov-Smimov test 
in high dimensions that has become the most popular one. The extension of D in 
p dimensions becomes: 
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D„ = sup|F„(x)-F(x)|; (5) 

X 

The estimation of D„ is complicated for p > 2. Therefore, Justel et al. suggest 
to approximate D„ with 

D„ = sup|F„(x) -F(x)|; (6) 

xeA 

where A is a domain for F{x). For large n, the power of the exact and approx- 



imate tests is similar. Percentiles of these distributions can be found in Ref. [11911 . 
The Kolmogorov-Smimov univariate test is usually quite powerful. However, as 
pointed out in Ref. uilsl], it is inefficient in detecting small clusterings in the data. 
Kolmonorov-Smimov multivariate test is not as reliable as the univariate one. 

The Cramer-von Mises EDF statistic [JJ estimates the integrated quadratic 
discrepancy between F„(jc) and F{x), 

/oo 
{Fn{x)-F{x)fxj/{x)dF{xy, (7) 
-oo 



where i//(jc) is a suitable weighting function. See Ref. Ill] for a detailed de- 
scription of different statistics derived by different choices for V^(x). Univariate 
gof of fit tests based on the Cramer-von Mises statistic are widely used. However, 
a general and reliable extension to higher dimensions has not yet been proposed 
[[l9,1 . although the multivariate distribution of Cramer-von Mises statistic has been 



thoroughly studied [|2l|l 



A major issue with the extension into the higher dimensions of EDF tests 
is that it leads loss of sensitivity in some parts of the observable space. In the 
multivariate space, some studies have shown that EDF gof test power quickly 
degrades [l5[|22ll. 

3.1. MLV 

Maximum Likelihood Value (MLV) uses the value of the likelihood function 
at its maximum to check how well the functional form approximates the observed 
events. A detailed description of this method can be found in the BaBar Statistics 
Report (Barlow, 2002) as well as in Ref. fi33[ l34ll. The maximum Likelihood (ML) 
is used for parameter estimation. The rationale behind this method is that if the 
value of / at its maximum for the observed data is large, the fit is good. To estimate 
the relative magnitude of Imax-obs^ ^ series of Monte Carlo data sets are generated 
according to the known functional form with the fitted parameters. Then, Imax 
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is estimated for each of them and finally the distribution of the simulated /„ 



compared with Imax-obs ll33ll . That is, the gof is estimated by 



1 - a = 1 - / /(^l^o) dx, (8) 

where x in our case is Imax, Xobs is the value measured in the fit to the data, and 
/(x|//o) is the distribution under Hq. 

Heinrich (2001) and Narsky (2003) show in different papers examples in which 
the MLV procedure fails to answer the question of how well the data are modeled 
by a certain density. The strong correlation between ML estimators and MLV 
leads one to accept the null hypothesis, even in cases in which the null hypothe- 
sis is clearly false. In practice, MLV tests the null hypothesis distribution against 
another distribution from the same family, not against any other possible distribu- 
tion. That is, if data are modeled as a certain Gaussian distribution, MLV gives no 
information about the possibility that the sample data were non-Gaussian. In the 
conclusion of his work, after performing a series of tests, Heinrich states: "My 
suspicion is that MLV never works (that is, it can't discover if the data does not 
match the form of the fitted distribution), but in any case, I would not use the 



method for a particular distribution without proof of validity " [|36|] . Narsky and 



metnod tor a particular distribution witnout pro ( 
Lyons' (2008) conclusions are similar 

3.2. Nearest Neighbors 



Another popular gof procedure is to use nearest neighbors. Unlike ^iid in 
general the majority of other gof techniques, methods based on Euclidean distance 
between nearest measured events are not greatly affected by the "curse of dimen- 



sionality" Il23n . That is, they can be more easily applied to an arbitrary number of 
dimensions. A first model to test gof using nearest neighbors has been suggested 
by Clark and Evans in 1954 for bivariate distributions ri24ll and by the same au- 
thors in 1979 for any dimension [25J. They propose a test based on the average 
distance between nearest neighbors in a region of space A. 

Diggle (1979) points out that Clark and Evans' approach ignores the inher- 
ent dependencies among the distances, which can lead to wrongly rejecting the 
null hypothesis. Diggle proposes a different nearest neighbor gof model that 
matches the marginal distribution function of nearest neighbors with the EDF. 
That is, one applies a univariate gof test such as Kolmogorov-Smimov or Cramer- 
von Mises using as an EDF an entire distribution of ordered distances to nearest 
neighbors [|26ll . 
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Ripley (1977) proposes to use a function K{t) that gives the number of points 
within distance t from a given point. The gof is tested by estimating the maximum 
difference between the measured and the expected value of K{t). 

Bickel and Breiman (1983) introduce a multidimensional gof test that consid- 
ers the distribution of the variable 



W, = exp{-ng{X,)V{R,)}, (9) 

where g{Xi) is the hypothesized density at the point X,, Ri is the distance from 
Xj to its nearest neighbor, and V{r) is the volume of a nearest neighbor sphere cen- 
tered atXi. Later Schilling (1983) generalizes this statistic by adding some specific 
weights that allow the test to better discriminate against prespecified contiguous 
alternatives of g{Xi) [28]. An approximation of both these statistics is given by 
Schilling for applications in which the number of dimensions is extremely high 



[12911 . Voronoi regions can be used as well instead of spheres [30]. 

Narsky (2003) suggests a nearest neighbor gof test that considers minimal and 
maximal cluster size. The average distance J- "'^ from the center of the cluster to m 
nearest neighbors is taken as a measure of cluster size. Ref. iSTI] gives suggestions 
on optimization of parameter m. The probability of observing the smallest and the 
largest cluster is used to test gof. 

A large number of nearest neighbor related procedures have been applied to 
two-sample multivariate testing problems, see, for example, Ref. [32]. The ratio- 
nale behind two-sample tests is similar to gof tests. The two data sets are mixed 
together and tests are typically based on counting the number of nearest neighbor 
events that belong to the same sample. 

Gof tests based on nearest neigbors are particularly suitable for localizing 
maximal deviations or small clusterings in the data, i.e. unexpected peaks in the 
data. That is, the kind of analysis in which the Kolmogorov-Smimov performance 
is weak. For a comparison between Narsky 's distance to nearest neighbor and 
Kolmogorov-Smimov performance in multi-dimension, see Ref. [5]. The nearest 
neighbor approach performs better for every test, showing a remarkable versatility 
against different alternative hypotheses. However, Narsky points out that "by no 
means it should be expected to provide the best discrimination against every alter- 
native hypothesis". When more is known about the alternative hypothesis, other 
tests are likely to achieve a better performance. 
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3.3. Neyman Smooth Tests 

Neyman's smooth gof test (1937) is historically considered to be the first 
smooth test of gof (although later it was proved that Pearson's was a smooth 
test as well). Unlike other tests discussed in this work, the Neyman test specifies 
the alternative hypothesis. That is, they define the alternative to f{x) as 

k 

g{x) = c(0)exp[£ eihi{x)]f{x), (10) 
1=1 

where C is a normalization function, Q are free parameters, and hi{x) are a 
set of orthonormal functions. If the null hypothesis is a uniform distribution, hi{x) 
are Legendre polynomials. For suggestions on how to choose hi{x) when checking 



for different distributions, see Ref. [|37|]. The Gof statistic is derived from the Rao 
statistic, that is 

lE/^.W]', (11) 

the null hypothesis is rejected for large values of the statistic. Crucial for 
the performance of Neyman smooth tests is the optimization of k. For a de- 
tailed overview about the possible choices of k, see Ledwina (1994). In her work. 



she suggests to use Schwarz's Bayesian information criterion BSll . Notably, she 
shows how choosing the right value for k dramatically improves test power. For 
an example of the power of the Neyman smooth gof test applied to multivariate 
physics data, see Ref. [i39J. The Neyman multivariate smooth test performs sim- 
ilarly to the Mardia's test. No indication about the optimization of the statistic 
parameter is given. 

3.4. Tree/Classifier Based Approach 

Friedman and Rafsky (1979) propose a generalization of the Wald-Wolfowitz 
run test based on Minimum Spanning Tree (MST). Although the MST test is the- 
oretically applicable to any univariate test based on an EDF, it achieves the best 
performance when used in conjunction with the Wald-Wolfowitz test. The Wald- 
Wolfowitz univariate two-sample test sorts all events in ascending order. Then, it 
labels each event depending on its original sample. Defining a run as a sequence 
of consecutive events with the same label, the null hypothesis is rejected for small 
run lenghts ll40ll . In a multivariate MST, a tree that connects all points without 
allowing closed circles is built. Events connected by a link are the closest events; 
that is, they are the nearest neighbors. In fact, the MST test is, from a theoretical 
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point of view, a nearest neighbor method. When performing the Wald-Wolfowitz 
test, links between events from different samples are removed. The null hypoth- 
esis is rejected for small numbers of sub-trees generated by this procedure [ 22ll . 
Interestingly, Friedman and Rafsky performed a series of power studies compar- 
ing the MST Wald-Wolfowitz test with a Kolmogorov-Smimov multivariate gen- 
eralization. In one dimension, the Kolmogorov-Smimov method is well-known 
to have a better performance and experimental results confirm it. However, as 
the number of dimensions increases, the difference between the two test perfor- 
mance decreases, and for number of dimensions > 5 Wald-Wolfowitz outperforms 
Kolmogorov-Smimov. This, once again, proves that the notion of rank is hardly 
extensible to high dimensions 1I22I1 . 

In 2003, Friedman presented a test based on any possible machine leaming 
classifier. However, Friedman suggests choosing a decision tree as a classifier 
In the case of rejection of the null hypothesis, a tree allows one to visualize the 
region of space in which observed data and the fit model differ. The procedure 
when a tree is used as a classifier is briefly decribed here, however the rationale is 
the same for any other machine leaming classifier. 

Different labels (-1-I/-I) are assigned to data from different samples. Then, a 
tree is built to separate the samples. Various optimization criteria to create a tree 
exist, Gini index is discussed here. The value of the Gini index associated with 
the original data is used to test gof. To derive the Gini index distribution, class 
labels are randomly permuted among the data many times and the resulting Gini 
indexes are estimated as for the original samples. The fraction of experiments that 
gives a negative Gini index higher than the original one is used to compute gof. 

As Friedman himself notes in his work, a potential disadvantage of this method 
is that a machine leaming classifier is very likely to find differences between sam- 
ples, given enough number of events. That is, even small differences between 
observed and expected data could lead to reject the null hypothesis [3]. A useful 
description of a practical application of Friedman's test can be found in the doc- 



umentation of the statistical package StatPattemRecognition (SPR) HlW . Advice 
on the optimization of classifier parameters, specifically for the gof procedure, are 
given in the SPR documentation as well. 

3.5. Distance Based Approach 

Cuadras et al. (1995,1997, 2003) introduce a multivariate Distance Based 
(DB) gof test based on the squared distance between two distributions. Given two 
samples of dataX and Y, they define the geometric variability of X as 
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VM = \I d\x,x')f{x)f{x')X{dx)X{dx'), (12) 

where d{x,x') is a distance function on the space S and A a specific measure 
chosen according to the characteristics of both vectors. V'rf(X), a version of Rao's 
quadratic entropy, is a generalization of the variance of X with respect to dissimi- 
larity d. 

The squared distance between samples X and Y is defined as 



A2(ni,n2) = j^^d'{x,y)f{x)g{y)X{dx)X{dy)-Vd{X)-Vd{Y), (13) 

where Di and 112 are the two populations represented as samples X, Y with 
density functions f{x) and g{y). is a Jensen difference that can be interpreted 
as a distance between two mean vectors [|43U44ll . 

The sampling version of DB test is then: 



yd{x) = ^f^f^d\x,,xj), 

^"1 ,=ij=i 



(14) 



and 



1 n\ "2 

A2(ni,n2) = — ££^'(x„3^,0-yrf(x)-yrf(3;). (i5) 

«in2,tl JTl 

Obviously, small values of the squared distance indicate strong similarity be- 
tween the two samples. In fact, DB tests the null hypothesis A^ = 0. In Ref. [|44ll . 
several applications of the DB test are presented, highlighting its versatility. 

Independently from Cuadras et al., Zech and Asian (2003-2005) suggest a 
multivariate test that can be viewed as a generalization of the DB one. They 
propose the Energy test whose two-sample statistic is 



^n,n2 = -l,R{\x:-Xj\) + -'£R{\yi-yj\) + R{\x, ~ yj\) , (16) 

where R{r) is a continuous, monotonic function of the Euclidean distance be- 
tween the vectors. They propose different functions for R{r). The best function 
depends on the characteristics of the data. However, they find that R = —ln(r + e) 
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is generally the best choice, offering a good rejection power against many differ- 
ent alternatives. Here, e is a cut-off introduced to improve the power of the test 
by avoiding problems caused by extremely small values of the distance. For a 
detailed description of possible distance functions as well as for a mathematical 
definition of the cut-off, see Ref. [45] . Thus, Zach and Asian suggest to compare 
two samples by using a logarithmic measure of point to point dissimilarity. 

Remarkably, Zach and Asian conducted extensive studies about power of their 
test in different configurations using physics data. In one dimension, the energy 
test is competitive with the most powerful univariate tests. In multidimensions, the 
energy test performs significantly better than any other test it has been compared 
with: Mardia's test, Neyman smooth test, a nearest neighbors test and Friedman- 
Rafsky test. Specifically, the best relative performance is achieved in the highest 
dimension (D=4). 

Lyons (2008), after reviewing the weaknesses of EDF and tests in multidi- 
mension, seems to consider DB/Energy tests as the most reliable choice in high 
dimensions. 

3.6. Summary 

Univariate gof methods have been extensively studied and analyzed. In high 
dimensions, a significant number of gof tests for unbinned data have been pro- 
posed. However, an accurate study of their power, in different configurations, has 
never been performed. Ideally, one would look for a test that is uniformly most 
powerful. In practice, such a test does not exist.The test power depends on the 
specific problem. Therefore, the choice of the test should depend on the analysis 
characteristics. Nearest neighbor tests seem a good choice when the alternative 
hypothesis is composite. Energy /DB tests have been shown to achieve good re- 
sults in high dimensions. However, there is no ultimate "best" gof test. It would 
be very useful to check different gof tests on several different alternatives in order 
to gain an understanding of the circumstances in which a specific gof test works 
well and those when it does not. 
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